Prosper is an online loan platform. The data refers to loans issued from July 2009 onwards. the dataset has many features varying in kind and scope, sometimes referring to characteristics of the loan itself and some other referring to the borrowers.
To get more general info about the dataset I will run some basic function.
## [1] 113937 81
81 features seem like a lot of columns! So the dataset contains almost 114,000 observations across 81 variables. For my analysis I will explore only a fraction of these varibles, focusing solely on those that I found interesting. More specifically, I will investigate the following:
“BorrowerRate”, “ProsperRating..Alpha.”, “ProsperScore”, “BorrowerState”, “ListingCategory..numeric.”, “CurrentDelinquencies”, “DelinquenciesLast7Years”, “StatedMonthlyIncome”, “IncomeRange”, “IsBorrowerHomeowner”, “TradesNeverDelinquent..percentage.”, “EmploymentStatus”.
## 'data.frame': 113937 obs. of 12 variables:
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ ProsperRating..Alpha. : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
## $ ProsperScore : num NA 7 NA 9 4 10 2 4 9 11 ...
## $ BorrowerState : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
## $ ListingCategory..numeric. : int 0 2 0 16 2 1 1 2 7 7 ...
## $ CurrentDelinquencies : int 2 0 1 4 0 0 0 0 0 0 ...
## $ DelinquenciesLast7Years : int 4 0 0 14 0 0 0 0 0 0 ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
## $ TradesNeverDelinquent..percentage.: num 0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
I have preferred to crete a separate dataframe with the variables of choice. The 12 variables selected are of different kinds: Some are categorical: “BorrowerState”, “IncomeRange”, “IsBorrowerHomeowner” “ProsperRating..Alpha.”. The rest appears to be intervals, assuming the intervals between values in ProsperScore are evenly spaced. In any case, we should investigate further to gain a better understanding of each.
## BorrowerRate ProsperRating..Alpha. ProsperScore BorrowerState
## Min. :0.0000 :29084 Min. : 1.00 CA :14717
## 1st Qu.:0.1340 C :18345 1st Qu.: 4.00 TX : 6842
## Median :0.1840 B :15581 Median : 6.00 NY : 6729
## Mean :0.1928 A :14551 Mean : 5.95 FL : 6720
## 3rd Qu.:0.2500 D :14274 3rd Qu.: 8.00 IL : 5921
## Max. :0.4975 E : 9795 Max. :11.00 : 5515
## (Other):12307 NA's :29084 (Other):67493
## ListingCategory..numeric. CurrentDelinquencies DelinquenciesLast7Years
## Min. : 0.000 Min. : 0.0000 Min. : 0.000
## 1st Qu.: 1.000 1st Qu.: 0.0000 1st Qu.: 0.000
## Median : 1.000 Median : 0.0000 Median : 0.000
## Mean : 2.774 Mean : 0.5921 Mean : 4.155
## 3rd Qu.: 3.000 3rd Qu.: 0.0000 3rd Qu.: 3.000
## Max. :20.000 Max. :83.0000 Max. :99.000
## NA's :697 NA's :990
## StatedMonthlyIncome IncomeRange IsBorrowerHomeowner
## Min. : 0 $25,000-49,999:32192 False:56459
## 1st Qu.: 3200 $50,000-74,999:31050 True :57478
## Median : 4667 $100,000+ :17337
## Mean : 5608 $75,000-99,999:16916
## 3rd Qu.: 6825 Not displayed : 7741
## Max. :1750003 $1-24,999 : 7274
## (Other) : 1427
## TradesNeverDelinquent..percentage. EmploymentStatus
## Min. :0.000 Employed :67322
## 1st Qu.:0.820 Full-time :26355
## Median :0.940 Self-employed: 6134
## Mean :0.886 Not available: 5347
## 3rd Qu.:1.000 Other : 3806
## Max. :1.000 : 2255
## NA's :7544 (Other) : 2718
Above I draw some very basic sumary statistics. In fact, it is useful to have some summary statistics at hand, while we will explore the different features and correlations.
Borrower Rate - The plot shows something unexpected: a spike at around 32% interest. There are many more loans with a ~32% borrower rate than any other rate. Hopefully we can learn why with further explorations. I excluded from the plot the top 0.1% rates, as I deemed them outliers.
ProsperRating..Alpha. - The Prosper Rating assigned at the time the listing was created:
N/A, HR, E, D, C, B, A, AA.
It is important to state that the Prosper Rating is a different variable from the Prosper Score, yet we don’t know how or why, as they look similar in aim, so we should further explor their relationship.
ProsperScore - The description says “ProsperScore is a custom risk score built using historical Prosper data. The score ranges from 1-10, with 10 being the best, or lowest risk score.” This is inconsistent with the plot: Max.:11.00, and it’s no outlier. The Prosper Score has been treated here as a factor.
It will be useful to compare, later in the report ProsperScore with ProsperRating..Alpha. as the purpose seems to be the same.
BorrowerState - It looks like there are twice as many loans from borrowers in CA than from any other state. There are many possible reasons why this has occurred: Perhaps ProsperLoans is most known in California, or maybe in some states people tend to borrow more and so on. Most importantly, California is the most popoulous state of the US. There are 52 entries, As it includes DC and an unspecified value.
ListingCategory - The description states: The category of the listing that the borrower selected when posting their listing:
0 - Not Available, 1 - Debt Consolidation, 2 - Home Improvement, 3 - Business, 4 - Personal Loan, 5 - Student Use, 6 - Auto, 7- Other, 8 - Baby&Adoption, 9 - Boat, 10 - Cosmetic Procedure, 11 - Engagement Ring, 12 - Green Loans, 13 - Household Expenses, 14 - Large Purchases, 15 - Medical/Dental, 16 - Motorcycle, 17 - RV, 18 - Taxes, 19 - Vacation, 20 - Wedding Loans
I was disappointed from the outcome: I believed I was going to see for what specific purpose people are borrowing money. However, the vast majority asks for a loan to consolidate (restructure or rearrange), previously issued debt, which tells us nothing about why they owe money in the first place. Likewise, the second and third highest bins read “Not Available” and “Other”, far from informative.
I decided to plot the bins in ascending order (rather than in the logical order 0-20) because it’s useful to see how certain value counts compare to others.
CurrentDelinquencies - I had to plot this on a Log scale as the distribution is EXTREMELY skewed. The almost totality of loans are issued to people who do not have current delinquencies.
DelinquenciesLast7Years - A similar skewedness appears here. almost everyone had 0 delinquencies in their recent history. It is important to say that although the rest of the data (the non 0 values) looks centered around 10 in a sort of normal fashion, in reality we should keep in mind that it’s not so “normal” as it’s plotted on a log scale.
TradesNeverDelinquent..percentage. - Conversely, if almost everyone has no delinquencies (and neither they have had), the vast majority of loans have been issued to people that did not trade delinquently. We must rember that a delinquent trade is a trade performed beyond the delinquency line set for that specific line of credit.
StatedMonthlyIncome - Self explanatory. The mode is 3,500 USD. I arbitrarily capped the data at 20,000 USD as there were extreme outliers for borrowers that had 6-digits monthly incomes.
IncomeRange - the bulk of the data hovers between 25k USD and 75k USD, which is actually within the range that I expected.
EmploymentStatus - The categories are hard to interpret: It is not clear why “Employed” should be different from “Full-Time” for instance, rather, it would be more common sense to consider “Full-Time” as a sub category of Employed. It is possible that often the data collected from the Borrower was not specific enough to indicate the kind of employment.
IsBorrowerHomeOwner - I believe it does not make a ton of sense to explore this variable visually as the summary table already does a good job in telling us how the dataset is distributed:
57478 Borrowers do own a house.
56459 Borrowers do not own a house.
The dataset I have subsetted has 113937 listings of loans from the Prosper Loan platform. There are 12 features: “BorrowerRate”, “ProsperRating..Alpha.”, “ProsperScore”, “BorrowerState”, “ListingCategory..numeric.”, “CurrentDelinquencies”, “DelinquenciesLast7Years”, “StatedMonthlyIncome”, “IncomeRange”, “IsBorrowerHomeowner”, “TradesNeverDelinquent..percentage.”, “EmploymentStatus”.
The main features of interest in my EDA will be the Borrower Rate together with the Prosper Rating and the Prosper Score. I want to explore the correlations between these variables and see how they interact with other features related to the personal aspects of a borrower’s life (more below).
All the other selected variables tell us something about the personal life of the borrower (e.g. where he lives, his income, why he needs the money, if he owns a house, if he is a good borrower and a timely payer). I think all this information might be correlated with the rate at which the borrower pays back money and the rating/score he has.
No, however, I plan on readjusting the levels of some categorical variables along the way.
As shown, some distributions were extremely skewed. In a few instances I used a different scal to plot the data in a more meaningful way. In some other instances, weird distributions in categorical data were due to bad data or other issues unknown to the writer.
In some visualizations I have rearranged the order of the entries. For IncomeRange it was necessary as the default visualization provided a random, non progressive order.
In BorrowerState and ListingCategory..numeric. I adjusted the order to make the distribution of data more visible and comparable.
One of the things I wanted to first investigate is the correlation between past and present delinquencies. From the plot above would seem that the more you have been delinquent in the past, the more you tend to be today. However, we should keep in mind that we are not performing any statistical test to determine the significance of this event.
Does being delinquent affect the rate at which you borrow money? It would look like there is some positive correlation between your repay rate ande both your present and past behaviour in terms of delinquencies. Again, we should be performing a statistical test in order to reinforce our suspects.
I arbitrarily capped the data above 30 and 70 delinquencies (current and past respectively), in order to eliminate outliers.
An useful legend:
0 - Not Available, 1 - Debt Consolidation, 2 - Home Improvement, 3 - Business, 4 - Personal Loan, 5 - Student Use, 6 - Auto, 7- Other, 8 - Baby&Adoption, 9 - Boat, 10 - Cosmetic Procedure, 11 - Engagement Ring, 12 - Green Loans, 13 - Household Expenses, 14 - Large Purchases, 15 - Medical/Dental, 16 - Motorcycle, 17 - RV, 18 - Taxes, 19 - Vacation, 20 - Wedding Loans
I was stroke by the mean and median rate difference between 9 and 10. However, before jumping to the conclusion that it’s better to take a loan for a catamaran than for a nose job, we should take a closer look to the ListingCategory..numeric. univariate plot above: 9 and 10 are indeed bins with very few observation, therefore, this high difference in rate can be the result of chance (e.g. non statistically significant).
Another important thing to notice is that the very frequent 32% rate is not due to a specific category of loans.
This is interesting. There is a very similar kind of correlation between the Borrower Rate and either Prosper Rating and Prosper Score. This in turn raises another question: If both these variables measure vitually the same thing, why would they need a double? Maybe they wanted a more granular view of the same kind of information, using a scale from 1 to 11 instead of another from HR to AA (7 levels). However, we can’t exactly say why, neither we know why, on average, a score of 5 is worse than a score of 4 when it comes to Borrower Rate.
As one would expect, it seems that on average, it is better to own a house when asking for a loan. However, the difference is very slight. I believe that the ownership of a house is an interesting feature and I would like to investigate it more in correlation with other variables.
The higher the Prosper Score, the higher the percentage of positive behavior in terms of delinquencies. There is an interesting aspect to notice in the plot: the entries that do not present a Score (labeled NA in the plot) hold the worst TradesNeverDelinquent ratio. This fact makes me question the idea that the unlabeled data was merely bad data, or entries that ProsperLoan failed to categorize with a specific score. From this perspective, “NA” seems like an actual bin, with mean and median just a bit below the Score 1.
We have discovered some interesting correlations, to summarize a few:
As delinquencies (present and past) increase, so does the Borrower Rate.
As Prosper Score and Prosper Rating increase, the Borrower Rate decreases.
As the proportion of the proportions of non delinquent trades /total trades approaches 100%, the Prosper Score increases.
Yes, amongst the features that lied outside the main variables of interest, I found that past and current delinquencies are positively correlated with each other.
The relation between BorrowerRate and ProsperRating appears to be the strongest and the most consistent. Moreover, the symmetry of ProsperScore and ProsperRating is evident, especially when we relate it to BorrowerRate. However, we must add that we had not compared the rating and the score with all the variables of the dataset, therefore, some difference might be there although not visible up until now.
In the previous section we said that we wanted to investigate more the IsBorrowerHomeOwner feature together with other variable that appeared to be highly correlated with BorrowerRate. Above you can see that the difference (in terms of interest rate) between owning a house or not is almost always minimal, across all levels of rating. The big difference lies in the not labeled data where a strong difference is present.
However, in the plot right below, we see that owning a house does make a difference when looking at the interest rate that a borrower is charged, against their IncomeRange.
I wanted to substitute the IncomeRange variable, with a non-categorical one. This allows us to draw a different kind of plot, while garnering more information on essentially the same question.
The plot above, shows a lot of overplotting, notwithstanding the reduced alpha. However, we can definitely see that typically the homeowners tend to make more money than those who do not own a house. The plot right below is extremely helpful, as we can see that the black and red high density levels, located in the souther region of the plot, have a non-negligeble difference in terms of BorrowerRate.
We said above that PropserScore looks very similar to ProsperRating. They appear to measure the same thing. However, the advantage of ProsperScore lies in its numeric nature, meaning that we are not forced to treat it like a factor, which in turns allows as to easily plot it against a categorical value.
Above we see that within this dataset, owning a house might be harmful depending of your employment status. This plot actually raises more questions than it gives answers: why, for instance, your score would benefit from you not having a house in case you are jobless or part-time employed?
I decided to draw this plot primarily to shed more lights on the suspects I had above when I said that the unlabeled data (in terms of Score) looked like yet another bin for very poor performers in terms of TradesNeverDelinquent. As we said before, Rating and Score look very similar in aim, so I decided to check if this pattern was emerging here as well. As we can see, the more blackish stripe is not HR, but the unlabeled one. This plot also reveals that the highest rated loans are almost never issued to people without an income (or with a very low one). Moreover, as one would expect, there are less entries that hold a high income for the lowest levels of ratings.
The most interesting finding in this section, in my opinion is that owning a house does not always reduce the borrower interest rate. One would expect that owning a house would alwaysdecrease the risk for the lender and in turn the interest rate, but in case you are unemployed, within this dataset you’d receive an higher rate on average. It is also clear that the variables ProsperScore, BorrowerRate and ProsperRating..Alpha. are strongly correlated with each other: we have said several times that Score and rating seem to have a similar aim and indeed you see how people not employed have a lower score if they own a house.
The surprising interactions were largely utlined above, however, an extra finding worth noticing is about that spike in distribution we have noticed around the 32% interest rate in the univariate section. We could not find an immediate explanation. It seems clear now that almost all the HR loans hover around that rate, as if it was the “go-to” interest rate in case you were labeled with HR.
I decided to focus on this plot because thanks the density curves, allow us to identify were the data lumps together: we see that there are three peaks of distribution around three levels of rates. These peaks are at around the same levels for both owners and not owners although the southermost peak highlights a difference between who owns a house and who does not. I believe the curves were an effective way to avoid the problems of overplotting.
These two plots were useful to idntify a trend within the dataset: how delinquencies correlate with the interest rate of the borrowers. In general, the higher the delinquencies (both past and current) the higher the rate, although at the same time, at very high levels of delinquencies the data becomes more and more sparse and the line of tendency is no longer very clear and accurate.
This plot is by far the most interesting in my opinion. It helps us notice that the Borrower’s with the highest ratings very rarely have a 0 USD monthly income. More importantly, The “purest” shades of red belong to the best ratings. A non-banal finding is that unlabeled data ranked as the worst performer in terms of delinquencies. I believe this to be an added value as it allows us to shed lights on data that held no label.
I started this project with the aim to explore how some of the personal characteristics of the borrowers might correlate with the interest rate and the rating of the borrower. I also wanted to confirm some of the beliefs I had, which in few cases were disproved (ex multis the importance of owning a house) in bargaining a lower interest rate.
I went through several technical struggles, I had to start ovr several times as the features I had initially selected did not hold interesting correlations with each other and ultimately, it took me a lot of trial and error before I could find the features that, if explored together, would garner some meaningful insight.
It was overall great to learn how powerful can R be in drawing compelling visualizations in so few (sometimes one) line of code.